Big is beautiful Bootstrapping a PoS tagger for Swedish
نویسنده
چکیده
A statistical part-of-speech tagger trained on a one-million word Swedish corpus with validated tags was used to tag two considerably larger untagged corpora (≈ 78 and 20 million words, respectively) to bootstrap new, improved, tagger models. The new taggers all showed better accuracy both for seen and unseen words, and the best tagger had 97.02% overall accuracy evaluated on the original corpus (using 10-fold cross-validation).
منابع مشابه
Size is not Everything Genre Balance in Bootstrapping a Swedish PoS Tagger
Part-of-speech tagging is a basic component of natural language processing, and as such, needs to be as accurate as possible, or any subsequent processing will suffer. For Swedish, most tagger models are trained on the Stockholm-Umeå Corpus (SUC Ejerhed et al., 2006). As SUC is a balanced corpus, SUC models are better representatives for general language than models trained on news texts only, ...
متن کاملExtending the View: Explorations in Bootstrapping a Swedish PoS Tagger
State-of-the-art statistical part-of-speech taggers mainly use information on tag bior trigrams, depending on the size of the training corpus. Some also use lexical emission probabilities above unigrams with beneficial results. In both cases, a wider context usually gives better accuracy for a large training corpus, which in turn gives better accuracy than a smaller one. Large corpora with vali...
متن کاملبررسی مقایسهای تأثیر برچسبزنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی
In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...
متن کاملStagger: A modern POS tagger for Swedish
The field of Part of Speech (POS) tagging has made slow but steady progress during the last decade, though many of the new methods developed have not previously been applied to Swedish. I present a new system, based on the Averaged Perceptron algorithm and semi-supervised learning, that is more accurate than previous Swedish POS taggers. Furthermore, a new version of the Stockholm-Umeå Corpus i...
متن کاملSome applications of a statistical tagger for Swedish
We will brie y describe a part-of-speech (POS) tagger for Swedish and discuss some applications: rule-based and probabilistic grammar checking, word prediction and keyword extraction. In POS tagging of a text, each word and punctuation mark in the text is assigned a morphosyntactic tag. We have designed and implemented a tagger based on a second order Hidden Markov Model [1]. Given a sequence o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006